Skip to content

fix(store): scope FSM_SYNC_MODE=nosync to raft-apply callers only#600

Merged
bootjp merged 2 commits intomainfrom
perf/fsm-apply-group-commit
Apr 23, 2026
Merged

fix(store): scope FSM_SYNC_MODE=nosync to raft-apply callers only#600
bootjp merged 2 commits intomainfrom
perf/fsm-apply-group-commit

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 23, 2026

Summary

Follow-up to #592 addressing the codex P2 correctness finding: ELASTICKV_FSM_SYNC_MODE=nosync previously affected direct (non-raft) callers of store.ApplyMutations / store.DeletePrefixAt, which have no raft-log replay as a durability backstop. Splits the API so the knob only relaxes fsync on the raft-apply path.

  • ApplyMutations / DeletePrefixAt are now unconditionally pebble.Sync. Safe for any direct caller (catalog bootstrap, admin snapshots, migrations, tests).
  • ApplyMutationsRaft / DeletePrefixAtRaft are the new raft-apply entry points; they observe s.fsmApplyWriteOpts. Called only from kv/fsm.go.

Code evidence for the bug

distribution/catalog.go:630 calls s.store.ApplyMutations directly from CatalogStore.applySaveMutations. main.go:749 wires distribution.NewCatalogStore around the raw raftGroupRuntime.store (a store.MVCCStore), so EnsureCatalogSnapshot -> CatalogStore.Save -> store.ApplyMutations is a production path that never goes through the raft apply loop. Under nosync, a crash before Pebble flush would lose acknowledged catalog writes with no raft entry to replay — exactly the failure mode codex P2 described.

Before / after per call site

Call site Before After
kv/fsm.go (raft apply: raw / 1PC / prepare / commit / abort / DEL_PREFIX) s.fsmApplyWriteOpts s.fsmApplyWriteOpts (via *Raft) — unchanged
distribution/catalog.go CatalogStore.applySaveMutations s.fsmApplyWriteOpts (UNSAFE) pebble.Sync (always)
adapter/*_test.go direct ApplyMutations / DeletePrefixAt calls s.fsmApplyWriteOpts pebble.Sync (always)
kv/shard_store.go, kv/leader_routed_store.go wrappers forwarded once forwarded twice (both variants)

Test plan

  • TestDirectApplyWriteOpts_AlwaysSync asserts the direct path resolves to pebble.Sync even when the FSM-apply mode is nosync.
  • TestDirectApplyMutations_NoSyncConfigured_StillWritesDurably exercises the public ApplyMutations / DeletePrefixAt entry points under a NoSync-configured store and confirms data is visible after a clean reopen.
  • Existing TestApplyMutations_NoSync* and BenchmarkApplyMutations_SyncMode renamed to *Raft to reflect that the knob now only governs the raft-apply path.
  • go test -race -count=1 ./... — all packages pass.
  • golangci-lint run ./... — 0 issues.

/gemini review
@codex review

Previously ApplyMutations/DeletePrefixAt used s.fsmApplyWriteOpts
unconditionally, which made ELASTICKV_FSM_SYNC_MODE=nosync affect direct
(non-raft) callers that do NOT have raft-log replay as a durability
backstop. Concrete production path:
distribution.EnsureCatalogSnapshot -> CatalogStore.Save ->
store.ApplyMutations. If the process crashed before Pebble flushed,
those acknowledged writes could be lost with no raft entry to re-apply.

This change splits the API:

- ApplyMutations / DeletePrefixAt: always pebble.Sync. Safe for any
  caller, including those that bypass raft (catalog bootstrap, admin
  snapshots, migrations, tests).
- ApplyMutationsRaft / DeletePrefixAtRaft: governed by
  s.fsmApplyWriteOpts (ELASTICKV_FSM_SYNC_MODE). Intended solely for the
  FSM apply loop; only the raft WAL fsync makes pebble.NoSync safe.

kv/fsm.go is the only caller of the new *Raft methods. CatalogStore and
every adapter test retain the always-sync ApplyMutations, so the
nosync opt-in can no longer silently weaken durability outside the
raft-apply path. MVCCStore / ShardStore / LeaderRoutedStore all
implement both variants; the in-memory mvccStore has no WAL so both
delegate to the same body.

Tests:
- Added TestDirectApplyWriteOpts_AlwaysSync asserting the direct path
  always resolves to pebble.Sync even when the FSM-apply mode is
  nosync.
- Added TestDirectApplyMutations_NoSyncConfigured_StillWritesDurably
  for functional coverage of the public entry points under a
  NoSync-configured store.
- Existing sync-mode tests + benchmark renamed to *Raft to reflect
  that the knob now only governs the raft-apply path.

Addresses codex P2 on #592.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 34 minutes and 39 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 34 minutes and 39 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3178e60a-54ae-47d3-a2ac-3800602c7702

📥 Commits

Reviewing files that changed from the base of the PR and between 0fb7db7 and 343e870.

📒 Files selected for processing (8)
  • kv/fsm.go
  • kv/leader_routed_store.go
  • kv/shard_store.go
  • store/lsm_store.go
  • store/lsm_store_sync_mode_benchmark_test.go
  • store/lsm_store_sync_mode_test.go
  • store/mvcc_store.go
  • store/store.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/fsm-apply-group-commit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bootjp bootjp enabled auto-merge April 23, 2026 17:37
@bootjp bootjp merged commit d9afba2 into main Apr 23, 2026
8 checks passed
@bootjp bootjp deleted the perf/fsm-apply-group-commit branch April 23, 2026 17:43
bootjp added a commit that referenced this pull request Apr 25, 2026
…ni medium)

Address two PR #636 review concerns:

Codex P2 — `4 × 1 MiB = 4 MiB` exactly is *over* `MaxSizePerMsg` once
protobuf framing overhead is counted. Each pb.Mutation carries the
Op tag, Key tag + bytes, and Value length prefix; the pb.Request
envelope wraps them; marshalRaftCommand prepends one byte. With a
normal-length objectKey the encoded entry is ~4 MiB + 300 B, with a
1 KiB objectKey it grows to ~4 MiB + 4 KiB. Either way the entry
falls into etcd/raft's util.go:limitSize oversized-first-entry path
(the documented "as an exception" branch), bypassing the
`MaxInflight × MaxSizePerMsg` per-peer memory bound the prior commit
was trying to enforce.

Fix: lower s3ChunkBatchOps from 4 to 3. `3 × 1 MiB = 3 MiB + few
hundred bytes` leaves ~1 MiB of headroom even for kilobyte-scale
keys, so the entry stays strictly under MaxSizePerMsg and the
batched-MsgApp path applies. Per-PUT Raft commit count grows ~5×
from the pre-PR-#636 baseline (5 GiB PUT: 320 → ~1707 entries) but
each fsync is ~5× smaller; the WAL group commit landed in PR #600
absorbs the rate.

Gemini medium — the previous commit also forced the cleanup paths
(cleanupPartBlobsAsync, deleteByPrefix, cleanupManifestBlobs,
appendPartBlobKeys) onto the same 4-op batch even though they carry
no chunk payload. Those ops are pure-key Dels; a 4-op batch
amplifies cleanup latency for large objects (5 GiB cleanup → ~1707
batches at the data-write size).

Fix: introduce s3MetaBatchOps = 64. With BlobKey-shaped keys
~100 B each, 64 keys × 100 B ≈ 6 KiB per batch, still three orders
of magnitude under MaxSizePerMsg. A 5 GiB-object cleanup commits
~80 batches instead of ~1707, so orphaned-blob garbage collection
finishes proportionally faster and does not amplify Raft load
relative to the data-write path. Updates the four cleanup callsites
(adapter/s3.go:1879, 1900, 1933, 2206, 2227) plus the appendPartBlobKeys
loop helper.

Tests:

- TestS3ChunkBatchFitsInRaftMaxSize builds a worst-case PutObject
  batch (1 KiB objectKey, fully-populated 1 MiB chunk values),
  marshals via the same protobuf path the real Dispatch uses, and
  asserts the encoded entry plus the marshalRaftCommand 1-byte
  framing prefix is strictly under MaxSizePerMsg = 4 MiB. A second
  assertion pins the headroom margin at >= 64 KiB so a future bump
  in s3ChunkBatchOps or s3ChunkSize is caught at PR time.
- TestS3MetaBatchFitsInRaftMaxSize is the equivalent invariant for
  s3MetaBatchOps on the cleanup path.

Build / vet / lint clean. S3 unit tests pass.
bootjp added a commit that referenced this pull request Apr 25, 2026
…s=4 (#636)

## Summary

S3 PutObject の Raft entry サイズを `MaxSizePerMsg` (PR #593 で 4 MiB)
と整列させます。

**変更:** `s3ChunkBatchOps = 16 → 4`。1 Raft entry = `s3ChunkBatchOps ×
s3ChunkSize ≒ 4 MiB` に揃える。1 行差分。

## Why

`etcd/raft` の `util.go:limitSize` には「単一 entry が `MaxSizePerMsg`
を超えていても、その entry だけは reject せず単独で MsgApp に載せる」という documented exception
があります (`"if the size of the first entry exceeds maxSize, a non-empty
slice with just this entry is returned"`). 16 MiB entry はこの経路で素通りするので:

| 項目 | s3ChunkBatchOps=16 | s3ChunkBatchOps=4 (本 PR) |
|---|---|---|
| 1 Raft entry size | ~16 MiB | ~4 MiB |
| Leader worst-case / peer | 1024 × 16 MiB = **16 GiB** | 1024 × 4 MiB =
**4 GiB** |
| 3-node cluster (leader) | 32 GiB | 8 GiB |
| WAL fsync per entry | 16 MiB | 4 MiB |
| 5 GiB PUT の Raft commit 数 | 320 | 1280 (4×) |

PR #593 の本文が謳う `1024 × 4 MiB = 4 GiB / peer` の memory bound は
**小エントリ前提**で、S3 経路では成立していませんでした。本 PR で S3 が普通の batched MsgApp
経路に乗るようになり、bound が S3 込みでも正確になります。

Raft commit 数が 4× に増える代わりに、各 fsync が 4× 速くなるので、PR #600 の WAL group
commit と相殺されてエンドツーエンド throughput はほぼ同等のはずです。

## Test plan

- [x] `go build ./...` clean
- [x] `golangci-lint run ./...` 0 issues
- [x] `go test ./adapter/ -short -run 'TestS3|S3Server'` pass
- [ ] CI

## Follow-ups (別 PR で design doc)

1. **S3 admission control** — 同時並行の S3 PUT body bytes に hard cap
を入れて、クライアント並列度が上がっても leader-side memory が `4 GiB × peers` を超えないようにする。
2. **Raft snapshot 戦略 / blob bypass** — follower fall-behind 時の snapshot
transfer に 5 GiB blob が乗ると不都合なので、blob を Raft 経路から外して manifest だけ raft
で同期するアーキテクチャ。

/gemini review
@codex review
bootjp added a commit that referenced this pull request Apr 25, 2026
## Summary

PR #636 (`s3ChunkBatchOps = 4` で Raft entry を `MaxSizePerMsg = 4 MiB`
に整列) のフォローアップとして、design doc を 2 本提出します。コードは含まず docs だけ。

### 1. `docs/design/2026_04_25_proposed_s3_admission_control.md`

**問題**: PR #636 は per-entry の memory accounting を直すが、aggregate (= 同時 PUT
数 × 各 PUT の inflight) は client 並列度で青天井。
**提案**: ノードが受け入れて Raft に commit する前の S3 PUT body bytes を **共通セマフォで hard
cap**。
- (A) `Content-Length` で request entry に pre-charge
- (B) `flushBatch` の per-batch (4 MiB) で sub-lease —
`dispatchAdmissionTimeout` 経過で 503 SlowDown
- デフォルト 256 MiB cap (GOMEMLIMIT=1800 MiB の ~14%)、env で tunable
- `Retry-After` 付き 503 → AWS SDK が自動 retry
- M1〜M4 の段階導入、metrics/Grafana panel まで含む

### 2. `docs/design/2026_04_25_proposed_s3_raft_blob_offload.md`

**問題**: 5 GiB PUT が WAL に丸ごと書かれ、follower catch-up と snapshot transfer が
O(stored bytes) でスケールしない。
**提案**: Raft が運ぶのは `ChunkRef{SHA256, Size}` (~100 B) と manifest だけ。chunk
本体は別キー名前空間 `!s3|chunkblob|<SHA>` でローカル Pebble に直書き、follower は applyLoop
で `chunkref` を見たら **非同期 fetch RPC で peer から取りに行く**。
- 1 GiB PUT の Raft 負荷: ~100 KiB に圧縮 (10⁴× 削減)
- Snapshot size: O(manifest count) になる
- Reference count + grace period の GC
- `ELASTICKV_S3_BLOB_OFFLOAD=true` でオプトイン、legacy `BlobKey` 経路は並走
- M0 spike 〜 M4 migrator までの 5 段階ロールアウト

両方とも:
- `docs/design/README.md` のファイル名・ヘッダ規約に準拠 (Status: Proposed / Date:
2026-04-25)
- 関連 PR (#593, #600, #612, #617, #636) と既存 design (workload isolation,
raft grpc streaming) を相互参照
- リスクと open questions を明示

## Test plan

- [x] markdown lint clean (textlint なし、手動確認)
- [x] ファイル名規約 `YYYY_MM_DD_proposed_<slug>.md` 準拠
- [x] 既存 design へのリンクが docs/design/README.md と整合

レビューは設計の方向性 (admission control の cap 値、blob offload の content-addressing
単位、M0-M4 の milestone 配分) に focus してもらえると助かります。

/gemini review
@codex review


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Documentation
* Added design proposal for S3 PUT request admission control and load
management strategy
* Added design proposal for S3 object storage optimization approach

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant